feat: add --dry-run estimation mode by mvanhorn · Pull Request #1592 · intel/auto-round

mvanhorn · 2026-03-22T06:05:25Z

Summary

Adds a --dry-run flag to the CLI that estimates VRAM usage, output file size, and approximate quantization time without running the full quantization process.

Loads only the model config via AutoConfig.from_pretrained() (no weights downloaded)
Estimates peak VRAM from parameter count, dtype, batch size, and sequence length
Estimates output file size from target bit width, parameter count, and group size overhead
Estimates time from layer count, iterations, and calibration batch count
Prints a formatted summary table and exits

Motivation

Users quantizing large models (70B+) need to know resource requirements before committing GPU hours. This is relevant to #1551 (reduce quant cost) and #1584 (peak VRAM tracking).

Example output

============================================================
  AutoRound Dry-Run Estimation
============================================================
  Model:              meta-llama/Llama-2-7b-hf
  Parameters:         6.61B
  Layers:             32
  Target bits:        4
  Group size:         128
  Model dtype:        float16
============================================================
  Estimated peak VRAM:    17.80 GB
  Estimated output size:  3.64 GB
  Estimated time:         3.4 hours
    (batch_size=8, seqlen=2048, nsamples=128, iters=200)
============================================================
  NOTE: These are rough estimates. Actual values depend on
  hardware, model architecture, and runtime conditions.
============================================================

Changes

New: auto_round/estimation.py - VRAM, disk, and time estimation functions
Modified: auto_round/__main__.py - --dry_run / --dry-run CLI flag, short-circuits before model loading
New: test/test_cpu/core/test_estimation.py - unit tests for all estimation functions

Testing

All estimation unit tests pass (parameter counting, VRAM estimation, output size calculation, time estimation, format helpers). Tests use stub configs to avoid model downloads.

Fixes #1591

This contribution was developed with AI assistance (Claude Code).

Add a --dry-run flag to the CLI that estimates VRAM usage, output file size, and approximate quantization time without running the full quantization process. Uses AutoConfig to load model architecture metadata without downloading weights. New module: auto_round/estimation.py with estimation functions for parameter count, peak VRAM, output size, and time. Relates to intel#1551 and intel#1584 Fixes intel#1591 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>

Refactor _count_parameters into smaller helpers to reduce local variable count. Convert dry_run_estimate to use **kwargs and extract helpers for config loading and result building. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>

for more information, see https://pre-commit.ci

wenhuach21 · 2026-03-22T07:45:03Z

auto_round/estimation.py

+    hidden_size^2 * num_layers heuristic when fields are missing.
+    """
+    hidden = getattr(config, "hidden_size", None)
+    num_layers = getattr(config, "num_hidden_layers", None)


We typically perform block-wise tuning. By a “block,” we mean a decoder layer, which typically contains 6–7 linear layers for non-moe models.

wenhuach21 · 2026-03-22T07:46:05Z

auto_round/estimation.py

+    - CUDA overhead and fragmentation (~20% buffer)
+    """
+    # Model weights
+    model_bytes = param_count * model_dtype_bytes


we need to cache some input data for the block when "low_gpu_mem_usage" is not enabled

xin3he · 2026-03-23T05:20:09Z

auto_round/estimation.py

+
+# Rough seconds per layer per iteration, measured on A100 for a 7B-class model.
+# Actual speed varies widely by hardware and model architecture.
+_SECS_PER_LAYER_PER_ITER = 0.12


Can we use a dummy block to get the real performance of current machine?

xin3he · 2026-03-23T05:27:10Z

auto_round/estimation.py

+
+    # Optimizer state: roughly 2x one block's parameters (momentum + variance for Adam)
+    # Approximate one block as total_params / num_layers
+    block_overhead = model_bytes * 0.05  # ~5% of model for one block's optimizer state


auto-round/auto_round/utils/device.py

Line 1204 in d02b2ed

card_0_used_memory = block_input_output_memory + layer_activation_memory + additional_memory

card_0_used_memory = block_input_output_memory + layer_activation_memory + additional_memory
I have summarized the key points regarding block_overhead here, and I hope this proves insightful for you.

xin3he · 2026-03-23T05:30:33Z

auto_round/estimation.py

+    hidden_size^2 * num_layers heuristic when fields are missing.
+    """
+    hidden = getattr(config, "hidden_size", None)
+    num_layers = getattr(config, "num_hidden_layers", None)


The num_hidden_layers may not cover most model cases. Claude could help refine it.
By the way, we may need to apply special handling to the MOE model.

mvanhorn · 2026-03-24T06:19:38Z

Thanks for the detailed feedback on the estimation approach.

@wenhuach21 Good point on block-wise tuning and the input caching overhead. I'll update the estimation to account for per-block input/output caching when low_gpu_mem_usage is disabled.

@xin3he The dummy block idea for real machine benchmarking is interesting - that would give more accurate estimates than extrapolation. I'll look into the block_overhead breakdown you linked and refine the estimation to handle MOE models separately. The num_hidden_layers limitation is a fair point.

mvanhorn and others added 2 commits March 21, 2026 23:05

mvanhorn force-pushed the osc/feat-dry-run branch from d042f9e to 8ce019d Compare March 22, 2026 06:20

[pre-commit.ci] auto fixes from pre-commit.com hooks

c11a7fd

for more information, see https://pre-commit.ci

wenhuach21 requested review from n1ck-guo and xin3he March 22, 2026 07:37

wenhuach21 reviewed Mar 22, 2026

View reviewed changes

xin3he reviewed Mar 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add --dry-run estimation mode#1592

feat: add --dry-run estimation mode#1592
mvanhorn wants to merge 3 commits intointel:mainfrom
mvanhorn:osc/feat-dry-run

mvanhorn commented Mar 22, 2026

Uh oh!

wenhuach21 Mar 22, 2026

Uh oh!

wenhuach21 Mar 22, 2026

Uh oh!

xin3he Mar 23, 2026

Uh oh!

xin3he Mar 23, 2026

Uh oh!

xin3he Mar 23, 2026

Uh oh!

mvanhorn commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mvanhorn commented Mar 22, 2026

Summary

Motivation

Example output

Changes

Testing

Uh oh!

wenhuach21 Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

wenhuach21 Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

xin3he Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

xin3he Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

xin3he Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

mvanhorn commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants